LS4003 R tutorial 1

Introduction to R programming

Why R?

R is a computer programming language that is heavily used across Life Science disciplines. This is built especially to make it easy to visualize large datasets and run statistical tests.

The skills you will learn in these workshops will be useful if you go into research, or transferable to many other careers.

Install and set-up options

To practice R, we’re going to use RStudio which is a visual development environment, where you can see your code, your outputs and your graphs side by side.

You have three ways of running R and RStudio:

  1. Using AppsAnywhere on a University Computer
  2. Free online account on Postit Cloud
  3. Install on your own computer

Look at the relevant tab below for instructions:

First, go to your OneDrive folder in file explorer, and create a new folder called “LS4003_Statistics”

Next, go to AppsAnywhere and load RStudio (R will install automatically).

Inside RStudio, under “console” (bottom left panel), locate your OneDrive by typing:

setwd("O:/")

Then click the ‘More’ cog icon (bottom right panel) and select go to working directory.

Find and click on your folder (LS4003_Statistics) to enter it.

Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.

This is an online, cloud-based option. It’s a bit more limited than running on a university computer or your own computer, but the free option should be enough for this module.

Go to Posit Cloud and create a free account

Log in, then go to New Project -> New RStudio Project.

Make a new folder in the bottom right panel (by clicking the New Folder button) called “LS4003_Statistics”.

Click on this folder to enter it, and then click the More cog (bottom right panel) and select “Set as Working Directory”.

To run R on your own machine, you have to install R (the programming language) and RStudio (the development environment).

When installing, click the most appropriate option for your machine (Windows/Mac/Linux)

Install R

Install RStudio

Once you have installed both, open RStudio.

Navigate to your Documents folder in bottom right panel. (If you can’t find it, type in setwd("~/Documents) to the console on the bottom left, then click the More cog on the bottom right and select “Go to Working Directory”)

Create a new folder called LS4003_Statistics by clicking the New Folder button on the right hand side.

Click on your folder (LS4003_Statistics) to enter it.

Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.

RStudio and the four panels

RStudio is split into the following four panels:

  • Top-left: Create and edit your R programming files
  • Top-right: See your data (tables and values)
  • Bottom-left: Console to run one or two lines of code
  • Bottom-right: View your folders and plots (different tabs)

Now we want to create our first R program.

In the top-left panel, select New -> RScript.

Save the file by going File -> Save or using the shortcut Control + S and naming the file:

"R tutorial 1".

It will automatically add the “.R” extension so we know it’s an R file - “R tutorial 1.R”

Install and load packages

In R we use many packages - these are extensions that contain the code we need to do cool things like plot graphs. There are many packages available for different uses, which means you don’t have to write everything from scratch!

Install the packages we will be using. In the console (bottom left corner), type in the following: a. install.packages('ggplot2') b. install.packages('ggpubr') c. install.packages('ggthemes')

R basics

Variables in R

We can write and run code in our R file, and then see the result (output) in our console.

Copy the following code to your R file, highlight both lines and click run.

Note: RStudio by default only runs the line you’ve selected. To run, you’ll need to highlight both lines and click “Run”, or click the arrow next to “Run” and select “Run all” which will run all the code in your file.

The variable you create will appear in the top right panel. What value does it have? Is this what you expect?

The result of the code will also appear in the console bottom left panel.

Let’s break that down:

  • a <- 5 + 10
    • a : This is our variable name. A variable is a piece of data with a name so we can find it again.
    • <- : This is an assignment operator; equivalent to ” = “! It means the data on the right hand side is saved under the variable name on the left hand side
    • 5 + 10 : Mathematical operations in R have a set format. If you want to multiply two numbers, you would use an asterix (*).
  • a
    • a : This line will get the value from our variable a, and print it to the console. It’s really useful for checking if a value is what you expect, especially when you have many variables!

Comments in R

If you put a hash # at the start of the line, this means it will be igorned and not run as code.

We use comments to add notes - to anyone else using our code, or to our future selves (you will forget what your code does!)

Error messages in R

Programming code is it’s own language, but with really strict grammar. If I mistakes make in sentrances, like I do in this one, you can still understand the meaning.

If we make a mistake in code, even a typo, the computer will not understand.

However, it does try to be helpful. We get an Error message in the console (bottom left panel) that gives us a clue as to what the problem is.

Try with the below code. What does the error message say? Can you try and fix it?

Vectors in R

After variables, the next thing we need to understand is vectors.
Vectors are a bit like lists but they can only hold one type of variable such as all integers (whole numbers) or all strings (words in “quotes”).

To make a vector, we need to use the c() function

What happens to the numbers in the my_numbers vector if you change 9 to 9.5?

Functions in R

We won’t be writing our own R functions, but we’ll be using in-built functions and functions from libraries.

A function is like a mini-program - it’s a set of code that somebody else has written that we can apply to our own variables.

These work using the following structure: function_name(arguments) where:

  • function_name is the name of the function
  • arguments are the values we are putting in, or any options we want to change

See examples below using some in-built functions:

We can also give multiple arguments by separating them with commas.

In the example below, we’re using the seq() function to generate a sequence of numbers. We want the sequence to start at 0, end at 20, and go up by 2 every number.

Give it a go:

From what you’ve learnt so far, can you:

  1. Generate every third number from 1 to 100 (1, 3, 6…)
  2. Find the mean of those numbers
  3. Find the standard deviation
Hint 1

Consider using the seq() function

seq(start value, stop value, by=increase)
Hint 2

Consider using the functions seq(), mean() and sd()

variablename <- seq(start value, stop value, by=increase)
mean(variablename)
sd(variablename)

R Dataframes

Create a dataframe

We’ve looked at single values, and multiple values in a list. The next step is to look at tables of data - we can do this with dataframe. You’ll commonly see this named as df for dataframe.

If you look in your environment (top right panel) you’ll now see your Data_Frame variable. If you click it, this will open it in a new tab next to your R script.

We can use the built-in function summary() to create an overview of our data. For our numerical columns, this provides us with the interquartile range.

Summarise the data

This is already all the information we need to create a box plot!

Draw a box plot

We’re going to use a library of functions called ggplot2 to make our box plot.

Before we can use the library, we need to install it. To do that, copy the following line into the console (bottom left section): install.packages('ggplot2')

Do the values match what we got in the summary() function?

Let’s break that down:

  • ggplot(Data_Frame, aes(y = Pulse)) +
    • ggplot() : This is our plotting function
    • Data_Frame: This is our argument of what data we want to use
    • aes(): aes is aesthetics, which allows us to choose our groups for x, y or categories
    • y = Pulse: this is setting our y value to be the pulse column
  • geom_boxplot()
    • This tells R to use the data we chose to make a boxplot

Styling the graphs

We can use the + to add new lines to our graph. This allows us to add more styling information.

You want a + at the end of every line until the last one. This tells R where you want to stop and move on.

Try it yourself

Can you change the code to make a boxplot of the Duration values?

Hint 1

The only parts you need to change is the aes() aesthetics, and ggtitle()

R from excel

Import an excel table as a dataframe

Next we’re going to use tabular data from excel and import that as our data.

I’ve put a csv file on the canvas page for you to download and copy into your folder. A csv file is like a plain excel table, but missing all the Microsoft formatting. This makes it easier for us to use.

You can turn any excel file into a csv file by going File -> Save as -> File format: csv. This will only work on one sheet, if you have multiple sheets they will be lost!

The dataset we are going to use is LocustSerotonin.csv, which contains serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. This is from Chapter 2 of The Analysis of Biological Data

As you can see we have a dataset of three columns, the first is serotonin levels in individual locusts, the second is their treatment group (0, 1 or 2 hours of being crowded), and the third is a ranking of their serotonin level as “Low”, “Medium”, or “High”.

Next, we’re going to use boxplots to compare the serotonin level for different treatment groups.

Question: What happens if you remove “group = treatmentTime” from the above code? What is this doing?

Let’s break that down line by line:

  • locust_df <- read.csv("LocustSerotonin.csv") - This is loading in our dataset. You should now see “locust_df” in your environment tab (top right corner)
  • ggplot(locust_df, aes(x = treatmentTime, y = serotoninLevel, group = treatmentTime)):
    • ggplot() is the graphing function we are using
    • locust_df is the data we want to plot
    • aes() is our aesthetics - what is our x and y values?
    • group: Our treatmentTime column is a continuous data (as it is numerical). We want this to be discrete (three separate groups) so we explicity tell ggplot to organise this data into groups.
  • geom_boxplot(): this is specifying we want to plot a box plot of our data

Discrete verses continuous scale

Here we have three distinct treatment groups. This is a discrete scale - no locusts were crowded for 1.5 hours for instance.

As this is numerical, it’s automatically taken as a continuous scale. We can see this when we use fill = to colour the boxplots by treatmentTime.

To make the same dataset use a discrete scale for this value, we can use the as.factor() function.
We can also change the title of our legend using guides().

Violin plots

Violin plots are similar to box plots, but they show more information on what the distribution is. You’ll see in the worksheet example how a violin plot can sometimes give more useful information than a box plot.

To change from a box plot to a violin plot, you only need to change from geom_boxplot() to geom_violin().

Question: If you add the line + geom_point(), what happens?

Bar plot by counts

Sometimes we don’t want to use numerical values (such as the serotonin level) that we already have in our dataset. If we want to plot the numbers of our categories we can make a bar plot like in the below example:

Here we have:

  1. Set locust_df$serotoninCategory as a factor with levels. This was to order our graph. What happens if you comment this out?

  2. Used geom_bar(stat = 'count') - for each category on the x axis, this counts the number of entries. Without using fill, this would just make a bar of height 10 for all three groups, as there was equal numbers of each treatment group.

  3. The fill is used to colour the sections by serotonin category.

Visualising distributions

We can also draw histograms to visualise distributions.

Is this a normal distribution?

It’s a small dataset so it’s hard to tell, but we’ll go through more examples in the worksheet.

That’s the end of the tutorial!

GIF of an otter clapping